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Abstract 

Today, many different probabilistic programming languages exist and even more inference 
mechanisms for these languages. Still, most logic programming based languages use back- 
ward reasoning based on SLD resolution for inference. While these methods are typically 
computationally efficient, they often can neither handle infinite and/or continuous distri- 
butions, nor evidence. To overcome these limitations, we introduce distributional clauses, 
a variation and extension of Sato's distribution semantics. We also contribute a novel ap- 
proximate inference method that integrates forward reasoning with importance sampling, 
a well-known technique for probabilistic inference. To achieve efficiency, we integrate two 
logic programming techniques to direct forward sampling. Magic sets are used to focus on 
relevant parts of the program, while the integration of backward reasoning allows one to 
identify and avoid regions of the sample space that are inconsistent with the evidence. 



1 Introduction 



The advent of statistical relational learning (Getoor and Taskar 2007 De Raedt 



et al. 2008) and probabilistic programming (De Raedt et al. 2008) has resulted in a 
vast number of different languages and systems such as PRISM ( [Sato and Kameya 
IGL ([Poole 2008|), ProbLog (|De Raedt et al. 2007[), Dyna ([Eisner et al 



2001 



2005 



BLPs (Kersting and De Raedt 2008), GLP(i3A^) ( Santos Gosta et al. 2008) 



BLOG (Milch et al. 2005), Ghurch (Goodman et al. 20081, IBAL (Pfeffer 2001) 



and MLNs (Richardson and Domingos 20061. While inference in these languages 



generally involves evaluating the probability distribution defined by the model, often 
conditioned on evidence in the form of known truth values for some atoms, this 
diversity of systems has led to a variety of inference approaches. Languages such as 
IBAL, BLPs, MLNs and CIjV{BN) combine knowledge-based model construction 
to generate a graphical model with standard inference techniques for such models. 
Some probabilistic programming languages, for instance BLOG and Ghurch, use 
sampling for approximate inference in generative models, that is, they estimate 
probabilities from a large number of randomly generated program traces. Finally, 
probabilistic logic programming frameworks such as IGL, PRISM and ProbLog, 
combine SLD-resolution with probability calculations. 

So far, the second approach based on sampling has received little attention in 
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logic programming based systems. In this paper, we investigate the integration 
of sampUng-based approaches into probabihstic logic programming frameworks to 
broaden the applicability of these. Particularly relevant in this regard are the abil- 
ity of Church and BLOG to sample from continuous distributions and to answer 
conditional queries of the form p{q\e) where e is the evidence. To accommodate 
(continuous and discrete) distributions, we introduce distributional clauses, which 
define random variables together with their associated distributions, conditional 
upon logical predicates. Random variables can be passed around in the logic pro- 
gram and the outcome of a random variable can be compared with other values by 
means of special built-ins. To formally establish the semantics of this new construct, 
we show that these random variables define a basic distribution over facts (using 



the comparison built-ins) as required in Sato's distribution semantics (Sato 1995), 
and thus induces a distribution over least Herbrand models of the program. This 
contrasts with previous instances of the distribution semantics in that we no longer 
enumerate the probabilities of alternatives, but instead use arbitrary densities and 
distributions. 



From a logic programming perspective, BLOG (Milch et al. 2005) and related 
approaches perform forward reasoning, that is, the samples needed for probability 
estimation are generated starting from known facts and deriving additional facts, 
thus generating a possible world. PRISM and related approaches follow the opposite 
approach of backward reasoning, where inference starts from a query and follows 
a chain of rules backwards to the basic facts, thus generating proofs. This differ- 
ence is one of the reasons for using sampling in the first approach: exact forward 
inference would require that all possible worlds be generated, which is infeasible in 
most cases. Based on this observation, we contribute a new inference method for 
probabilistic logic programming that combines sampling-based inference techniques 
with forward reasoning. On the probabilistic side, the approach uses rejection sam- 



pling (Koller and Friedman 2009), a well-known sampling technique that rejects 



samples that are inconsistent with the evidence. On the logic programming side. 



we adapt the magic set technique (Bancilhon et al. 1986) towards the probabilistic 



setting, thereby combining the advantages of forward and backward reasoning. Fur- 
thermore, the inference algorithm is improved along the lines of the SampleSearch 



algorithm (Gogate and Dechter 2011 ), which avoids choices leading to a sample that 
cannot be used in the probability estimation due to inconsistency with the evidence. 
We realize this using a heuristic based on backward reasoning with limited proof 
length, the benefit of which is experimentally confirmed. This novel approach to 
inference creates a number of new possibilities for applications of probabilistic logic 
programming systems, including continuous distributions and Bayesian inference. 



This paper is organized as follows: we start by reviewing the basic concepts in 
Section[2] Section[3]introduces the new language and its semantics, Section|4]a novel 
forward sampling algorithm for probabilistic logic programs. Before concluding, we 
evaluate our approach in Section [5] 
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2 Preliminaries 

2.1 Probabilistic Inference 

A discrete probabilistic model defines a probability distribution p(-) over a set J7 of 
basic outcomes, that is, value assignments to the model's random variables. This 
distribution can then be used to evaluate a conditional probability distribution 
p{q\e) = I J. ; also called target distribution. Here, g is a query involving random 
variables, and e is the evidence^ that is, a partial value assignment of the random 
variableaM Evaluating this target distribution is called probabilistic inference (Roller 



and Friedman 2009 1 . In probabilistic logic programming, random variables often 



correspond to ground atoms, and p{-) thus defines a distribution over truth value 



assignments, as we will see in more detail in Sec. 2.3 (but see also De Raedt et al 



20081. Probabilistic inference then asks for the probability of a logical query being 
true given truth value assignments for a number of such ground atoms. 

In general, the probability p{-) of a query q is in the discrete case the sum over 
those outcomes a; € fi that are consistent with the query. In the continuous case, 
the sum is replaced by an (multidimensional) integral and the distribution p{-) by 
a (product of) densities F(-) That is, 

P(.q)^^P{^)M^): and p{q) ^ I ■ ■ ■ I tg{uj)dF{uj) (1) 



where 'i-q{oj) = I ii uj \= q and otherwise. As common (e.g. (Wasserman 2003)) 
we will use for convenience the notation J xdF{x) as unifying notation for both 
discrete and continuous distributions. 

As J7 is often very large or even infinite, exact inference based on the summation 
in (fl]) quickly becomes infeasible, and inference has to resort to approximation 
techniques based on samples, that is, randomly drawn outcomes w G f2. Given a 
large set of such samples {si, . . . , sn} drawn from p{-), the probability p{q) can be 
estimated as the fraction of samples where q is true. If samples are instead drawn 
from the target distribution p(-|e), the latter can directly be estimated as 

1 ^ 

However, sampling from p(-|e) is often highly inefficient or infeasible in practice, 
as the evidence needs to be taken into account. For instance, if one would use 
the standard definition of conditional probability to generate samples from p(-), all 
samples that are not consistent with the evidence do not contribute to the estimate 
and would thus have to be discarded or, in sampling terminology, refected. 

More advanced sampling methods therefore often resort to a so-called proposal 
distribution which allows for easier sampling. The error introduced by this sim- 
plification then needs to be accounted for when generating the estimate from the 



^ Ii e contains assignments to continuous variables, p(e) is zero. Hence, evidence on continuous 
values has to be defined via a probability density function, also called a sensor model. 
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set of samples. An example for such a method is importance sampling, where each 
sample Si has an associated weight Wi. Samples are drawn from an importance dis- 
tribution 7r(-|e), and weights are defined as Wi = fTj-r^- The true target distribution 
can then be estimated as 



1 ^ 

i—l 

where W — ^^ Wi is a normalization constant. The simplest instance of this algo- 
rithm is rejection sampling as already sketched above, where the samples are drawn 
from the prior distribution p{-) and weights are 1 for those samples consistent with 
the evidence, and for the others. Especially for evidence with low probability, 
rejection sampling suffers from a very high rejection rate, that is, many samples are 
generated, but do not contribute to the final estimate. This is known as the rejection 
problem. One way to address this problem is likelihood weighted sampling, which 
dynamically adapts the proposal distribution during sampling to avoid choosing 
values for random variables that cause the sample to become inconsistent with the 
evidence. Again, this requires corresponding modifications of the associated weights 
in order to produce correct estimates. 

2.2 Logical Inference 

A (definite) clause is an expression of the form h :- bi, . . . , bn, where h is called head 
and bi, . . . ,bn is the body. A program consists of a set of clauses and its semantics 
is given by its least Herbrand model. There are at least two ways of using a definite 
clause in a logical derivation. First, there is backward chaining, which states that 
to prove a goal h with the clause it suffices to prove bi, . . . ,bn; second, there is 
forward chaining, which starts from a set of known facts bi , . . . , b^ and the clause 



and concludes that h also holds (cf. (Nilsson and Maliszynski 1996)). Prolog em 



ploys backward chaining (SLD-resolution) to answer queries. SLD-resolution is very 
efficient both in terms of time and space. However, similar subgoals may be derived 
multiple times if the query contains recursive calls. Moreover, SLD-resolution is not 
guaranteed to always terminate (when searching depth- first) . Using forward reason- 
ing, on the other hand, one starts with what is known and employs the immediate 
consequence operator Tp until a fixpoint is reached. This fixpoint is identical to the 
least Herbrand model. 

Definition 1 (Tp operator). Let P be a logic program containing a set of definite 
clauses and ground{P) the set of all ground instances of these clauses. Starting 
from, a set of ground facts S the Tp operator returns 

Tp{S) = {h I h ;- bi, . . . , bn G ground(P) and {bi, . . . , bn} C S} 
2.3 Distribution Semantics 



Sato's distribution semantics (Sato 1995) extends logic programming to the prob- 



abilistic setting by choosing truth values of basic facts randomly. The core of this 
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semantics lies in splitting the logic program into a set F of facts and a set R of 
rules. Given a probability distribution Pp over the facts, the rules then allow one 
to extend Pp into a distribution over least Herbrand models of the logic program. 
Such a Herbrand model is called a possible world. 

More precisely, it is assumed that DB = FU R is ground and denumerable, and 
that no atom in F unifies with the head of a rule in R. Each truth value assignment 
to F gives rise to a unique least Herbrand model of DB. Thus, a probability distribu- 
tion Pp over F can directly be extended into a distribution Pdb over these models. 
Furthermore, Sato shows that, given an enumeration /i, /2, • • ■ of facts in F, Pp 
can be constructed from a series of finite distributions Pp" (/i — xi, . . . , fn — Xn) 
provided that the series fulfills the so-called compatibility condition, that is, 

Pp ifl = Xi, . . . , fn = Xn) = 2_^ Pp (/l — 2;i, . . . , fn+l — a^n+l) (2) 

3 Syntax and Semantics 

Sato's distribution semantics, as summarized in Sec.|2.3[ provides the basis for most 



probabilistic logic programming languages including PRISM (Sato and Kameya 



2001), ICL (Poole 2008), CP-logic (Vennekens et al. 2009) and ProbLog (DeRaedt 



et al. 2007). The precise way of defining the basic distribution Pp differs among 
languages, though the theoretical foundations are essentially the same. The most 
basic instance of the distribution semantics, employed by ProbLog, uses so-called 
probabilistic facts. Each ground instance of a probabilistic fact directly corresponds 
to an independent random variable that takes either the value "true" or "false". 



These probabilistic facts can also be seen as binary switches, cf. (Sato 1995 1, which 
again can be extended to multi-ary switches or choices as used by PRISM and ICL. 
For switches, at most one of the probabilistic facts belonging to the switch is "true" 
according to the specified distribution. Finally, in CP-logic, such choices are used 
in the head of rules leading to the so-called annotated disjunction. 



Hybrid ProbLog (Gutmann et al. 2010) extends the distribution semantics with 



continuous distributions. To allow for exact inference. Hybrid ProbLog imposes 
severe restrictions on the distributions and their further use in the program. Two 
sampled values, for instance, cannot be compared against each other. Only compar- 
isons that involve one sampled value and one number constant are allowed. Sampled 
values may not be used in arithmetic expressions or as parameters for other distri- 
butions, for instance, it is not possible to sample a value and use it as the mean of 
a Gaussian distribution. It is also not possible to reason over an unknown number 



of objects as BLOC (Milch et al. 2005) does, though this is the case mainly for 
algorithmic reasons. 

Here, we alleviate these restrictions by defining the basic distribution Pp over 
probabilistic facts based on both discrete and continuous random variables. We use 
a three-step approach to define this distribution. First, we introduce explicit ran- 
dom variables and corresponding distributions over their domains, both denoted by 
terms. Second, we use a mapping from these terms to terms denoting (sampled) 
outcomes, which, then, are used to define the basic distribution Pp on the level of 



6 B. Gutmann, I. Thon, A. Kimmig, M. Bruynooghe, L. De Raedt 

probabilistic facts. For instance, assume that an urn contains an unknown number 
of balls where the number is drawn from a Poisson distribution and we say that this 
urn contains many balls if it contains at least 10 balls. We introduce a random vari- 
able number, and we define many :- dist_gt(~(number), 9). Here, ~(number) is the 
Herbrand term denoting the sampled value of number, and dist_gt(~(number), 9) 
is a probabilistic fact whose probability of being true is the expectation that this 
value is actually greater than 9. This probability then carries over to the derived 
atom many as well. We will elaborate on the details in the following. 



3. 1 Syntax 

In a logic program, following Sato, we distinguish between probabilistic facts, which 
are used to define the basic distribution, and rules, which are used to derive addi- 
tional atomsjj Probabilistic facts are not allowed to unify with any rule head. The 
distribution over facts is based on random variables, whose distributions we define 
through so called distributional clauses. 

Definition 2 (Distributional clause). A distributional clause is a definite clause 
with an atom h ^ V in the head where '^ is a binary predicate used in infix notation. 

For each ground instance (h ~ I? :- bi, . . . ,bn)6' with 9 being a substitution 
over the Herbrand universe of the logic program, the distributional clause de- 
fines a random variable hd and an associated distribution Vd. In fact, the dis- 
tribution is only defined when (bi, . . . ,h^)9 is true in the semantics of the logic 
program. These random variables are terms of the Herbrand universe and can 
be used as any other term in the logic program. Furthermore, a term ~ (d) con- 
structed from the reserved functor ~/l represents the outcome of the random vari- 
able d. These functors can be used inside calls to special predicates in distjrel = 
{dist_eq/2,distJt/2,distJeq/2,dist_gt/2,dist_geq/2}. We assume that there is a 
fact for each of the ground instances of these predicate calls. These facts are the 
probabilistic facts of Sato's distribution semantics. Note that the set of probabilistic 
facts is enumerable as the Herbrand universe of the program is enumerable. A term 
—{d) links the random variable d with its outcome. The probabilistic facts compare 
the outcome of a random variable with a constant or the outcome of another ran- 
dom variable and succeed or fail according to the probability distribution(s) of the 
random variable(s). 

Example 1 (Distributional clauses). 

nballs ^ poisson(6). (3) 

color(B) -- [0.7 : b, 0.3 : g] ;- between(l,~(nballs),B). (4) 
diameter(B,MD) -- gamma(MD/20, 20) ;- between(l, ~(nballs), B), 

mean_diameter(~(color(B)),MD). (5) 



A rule can have an empty body, in which case it represents a deterministic fact. 
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The defined distributions depend on the following logical clauses: 

mean_diameter(C, 5) ;- dist_eq(C, b). 
mean_dicmieter(C, 10) ;- dist_eq(C, g). 
between(l, J, I) ;- dist_leq(l, J). 
betweeii(l, J,K) ;- dist_lt(l, J), II is I + 1, between(ll, J, K). 

The distributional clause (|3| models the number of balls as a Poisson distribution 
with mean 6. The distributional clause Q models a discrete distribution for the 
random variable color(B). With probability 0.7 the ball is blue and green otherwise. 
Note that the distribution is defined only for the values B for which 6etween(l,~ 
(nballs), B) succeeds. Execution of calls to the latter give rise to calls to probabilistic 
facts that are instances of distJeq{I , c^{nballs)) anddistJ,t[I,'^[nballs)). Similarly, 
the distributional clause ([5| defines a gamma distribution that is also conditionally 
defined. Note that the conditions in the distribution depend on calls of the form, 
meanjliameteT{p±[color{n)),MD) with n a value returned by between/3. Execution 
of this call finally leads to calls dist_eq{'^{color{n)),b) and dist_eq{c:i{color(n)), g). 

It looks feasible, to allow ~ (d) terms everywhere and to have a simple pro- 
gram analysis insert the special predicates in the appropriate places by replacing 
< /2, > /2, < /2, > /2 predicates by distjrel/2 facts. Though extending unifica- 
tion is a bit harder: as long as a —{h) term is unified with a free variable, standard 
unification can be performed; only when the other term is bound an extension is re- 
quired. In this paper, we assume that the special predicates dist_eq/2, dist_lt/2, 
dist_leq/2, dist_gt/2, and dist_geq/2 are used whenever the outcome of a ran- 
dom variable need to be compared with another value and that it is safe to use 
standard unification whenever a —(h) term is used in another predicate. 

For the basic distribution on facts to be well-defined, a program has to fulfill a 
set of validity criteria that have to be enforced by the programmer. 

Definition 3 (Valid program). A program P is called valid if: 

(VI) In the relation h ~ I? that holds in the least fixpoint of a program, there is 
a functional dependency from h to V , so there is a unique ground distribution T> 
for each ground random variable h. 

(V2) The program is distribution-stratified, that is, there exists a function rank(-) 
that maps ground atoms to N and that satisfies the following properties: (1) for 
each ground instance of a distribution clause h ^ 2? .■- bi, . . .bj, holds rankih. ~ 
T> > ranfc (bi) (for all i). (2) for each ground instance of another program clause: 
h ;- bi, . . . bn holds rank{h) > ranfc(bi) (for all i). (3) for each ground atom b 
that contains (the name of) a random variable h, ranfc(b) > ranfc(h ~ T>) (with 
h ^ T> the head of the distribution clause defining In). 

(V3) All ground probabilistic facts or, to be more precise, the corresponding indi- 
cator functions are Lebesguc-measurable. 

(V4) Each atom in the least fixpoint can be derived from a finite number of prob- 



abilistic facts ('finite support condition (Sato 1995)) 
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Together, (VI) and (V2) ensure that a smgle basic distribution Pp over the 
probabihstic facts can be obtained from the distributions of individual random 
variables defined in P. The requirement (V3) is crucial. It ensures that the series 

(n) 

of distributions Pp needed to construct this basic distribution is well-defined. 
Finally, the number of facts over which the basic distribution is defined needs to be 
countable. This is true, as we have a finite number of constants and functors: those 
appearing in the program. 



3.2 Distribution Semantics 

We now define the series of distributions Pp where we fix an enumeration /i , /2 , . . . 
of probabilistic facts such that i < j => rank(fi) < rank{fj) where rank{-) is 
a ranking function showing that the program is distribution-stratified. For each 
predicate rel/2 e distjrel, we define an indicator function as follows: 

I^ (X X)^l^ ifrel(Xi,X2)istrue 
'■' ^' \o ifrel(Xi,X2) isfalse 

Furthermore, we set I^^i{Xi,X2) = 1.0 — I^^i{Xi,X2). We then use the expected 

(n) 

value of the indicator function to define probability distributions Pp over finite 
sets of ground facts /i, . . . , /„. Let {rvi, . . . rvm} be the set of random variables 
these n facts depend on, ordered such that if rank{rvi) < rank{rvj), then i < j 
(cf. (V2) in Definition l3| . Furthermore, let fi = reli{tii,ti2), Xj € {1,0}, and 
0^^ — {^{rvi)/Vi, . . . ,~(rUm)/Fm}. The latter replaces all evaluations of random 
variables on which the fi depend by variables for integration. 

P}"' (/l = XI, ...,/„ = X„) = E[I^^^^ (ill, il2), . . . , 0„ {tnl,tn2)] (7) 



Example 2 (Basic Distribution). Let /i, /2, • ■ • = rfisiJt(~(61), 3), dist_/i(~(62), ~ 
(61)), . . .. The second distribution in the series then is 

P^p\distJt{~{bl),3) = xi,distJt{~{b2),~{bl)) = X2) 
= E[/,7s,_i,(^(61), 3), /S^3t.it(^(&2), ^(W))] 

= j j {IlU_i^{Vl,3),Ill,,_^,{V2,Vl))dV,^{V\)dV,2{V2) 

By now we are able to prove the following proposition. 

Proposition 1. Let P be a valid program. P defines a probability measure Pp 
over the set of fixpoints of the Tp operator. Hence, P also defines for an arbitrary 
formula q over atoms in its Herbrand base the probability that q is true. 

Proof sketch. It suffices to show that the series of distributions Pp over facts 



(cf. ([7|)) is of the form that is required in the distribution semantics, that is, these 
are well-defined probability distributions that satisfy the compatibility condition, 
cf. (pi). This is a direct consequence of the definition in terms of indicator functions 
and the measurability of the underlying facts required for valid programs. D 
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3.3 Tp Semantics 

In the following, we give a procedural view onto the semantics by extending the 
Tp operator of Definition fll to deal with probabilistic facts dist_rel(ti, t2). To do 
so, we introduce a function ReadTable(-) that keeps track of the sampled val- 
ues of random variables to evaluate probabilistic facts. This is required because 
interpretations of a program only contain such probabilistic facts, but not the un- 
derlying outcomes of random variables. Given a probabilistic fact dist_rel(tl, t2), 
ReadTable returns the truth value of the fact based on the values of the random 
variables h in the arguments, which are either retrieved from the table or sampled 
according to their definition h ^ I? as included in the interpretation and stored in 
case they are not yet available. 

Definition 4 (Stochastic Tp operator). Let P be a valid program and ground{P) 
the set of all ground instances of clauses in P. Starting from a set of ground facts 
S the STp operator returns 

STp{S) '■= \i>- h ;- bi, . . . ,bii G ground{P) and V bi : either hi E S or 

(bi = dist_rel{tl,t2) A {tj =~(/i) ^ (h - 2?) G S)A 

READTABLE(fei) = true) > 

ReadTable ensures that the basic facts are sampled from their joint distribution 



as defined in Sec. |3.2| during the construction of the standard fixpoint of the logic 
program. Thus, each fixpoint of the STp operator corresponds to a possible world 
whose probability is given by the distribution semantics. 

4 Forward sampling using Magic Sets and backward reasoning 

In this section we introduce our new method for probabilistic forward inference. To 
this aim, we first extend the magic set transformation to distributional clauses. We 
then develop a rejection sampling scheme using this transformation. This scheme 
also incorporates backward reasoning to check for consistency with the evidence 
during sampling and thus to reduce the rejection rate. 

4-1 Probabilistic magic set transformation 

The disadvantage of forward reasoning in logic programming is that the search 
is not goal-driven, which might generate irrelevant atoms. The magic set trans- 



formation (Bancilhon et al. 1986 Nilsson and Maliszynski 1996) focuses forward 



reasoning in logic programs towards a goal to avoid the generation of uninteresting 
facts. It thus combines the advantages of both reasoning directions. 

Definition 5 (Magic Set Transformation). If P is a logic program, then we use 
Magic(P) to denote the smallest program such that if Aq :- Ai, . . . , A^ € P then 

• Ao ;- c(Ao),Ai,. . . , An € Magic(P) and 

• for each I < i <n: c(Ai) ;- c(Ao),Ai,. . . ,Ai_i e Magic(F) 
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The meaning of the additional c/1 atoms (c=call) is that they "switch on" clauses 
when they are needed to prove a particular goal. If the corresponding switch for the 
head atom is not true, the body is not true and thus cannot be proven. The magic 
transformation is both sound and complete. Furthermore, if the SLD-tree of a goal 
is finite, forward reasoning in the transformed program terminates. The same holds 
if forward reasoning on the original program terminates. 

We now extend this transformation to distributional clauses. The idea is that the 
distributional clause for a random variable h is activated when there is a call to a 
probabilistic fact distjrel(ti,t2) depending on h. 

Definition 6 (Probabilistic Magic Set Transformation). For program P, let Pl be 
P without distributional clauses. M(P) is the smallest program s.t. Magic(Pl) Q 
M(P) and for each h ^ V ;- bi, . . . ,bn G P and rel G {eq, It, leq, gt, geq}; 

• h^V :- (c(dist_rel(~(h),X));c(dist_rel(X, ~(h))),bi,...,bn. G M(P). 

• c(bi) ;- (c(dist_rel(~(h),X));c(dist_rel(X,~(h))),bi,...,bi_i. G M(P). 

Then PMagic(P) consists of: 

• a clause a_p(ti, . . . , tn) ;- c(p(ti, . . . , tn)),p(ti, . . . , tn) for each built-in pred- 
icate (including dist_rel/2 for rel G {eq. It, leq, gt, geq}j used in M(P). 

• a clause h :- h^, . . . , b^ for each clause h ;- b^, . . . , b^ G M(P) where h[ — a_h± 
if h± uses a built-in predicate and else h[ = bi . 

Note that every call to a built-in b is replaced by a call to a_b; the latter pred- 
icate is defined by a clause that is activated when there is a call to the built-in 
(c(b)) and that effectively calls the built-in. The transformed program computes 
the distributions only for random variables whose value is relevant to the query. 
These distributions are the same as those obtained in a forward computation of the 
original program. Hence we can show: 

Lemma 1. Let P be a program and PMagic(P) its probabilistic magic set trans- 
formation extended with a seed c(q). The distribution over q defined by P and by 
PMagic(P) is the same. 

Proof sketch. In both programs, the distribution over q is determined by the distri- 
butions of the atoms dist_eq{ti,t2), dist_leq{ti,t2), dist_lt{ti,t2), dist_geq{ti,t2), 
and dist_gt{ti, ^2) on which q depends in a forward computation of the program P. 
The magic set transformation ensures that these atoms are called in the forward 
execution of PMagic(P). In PMagic(P), a call to such an atom activates the dis- 
tributional clause for the involved random variable. As this distributional clause is 
a logic program clause, soundness and completeness of the magic set transformation 
ensures that the distribution obtained for that random variable is the same as in 
P. Hence also the distribution over q is the same for both programs. D 

4-2 Rejection sampling with heuristic lookahead 



As discussed in SectionjSrT] sampling-based approaches to probabilistic inference es- 
timate the conditional probability p{q\e) of a query q given evidence e by randomly 
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Algorithm 1 Main loop for sampling-based inference to calculate the conditional 
probability p{q\e) for query q, evidence e and program L. 

1: function Evaluate(L, q, e, Depth) 

2: L* :=PMAGic(L)U{c(a)|a e eUg} 

3: n+ :=: n- --0 

4: while Not converged do 

5: (/, w) :=STPMagic(L* , L, e, Depth) 

6: if g e / then n+ := n^ + w else n^ := n^ + w 

7: return n'^ /{n^ + n^) 

Algorithm 2 Sampling one interpretation; used in Algorithm [T] 
1: function STPMagic{L* , L,e, Depth) 
2: Tpf := 0, Tdis := 0, w := 1, /o;d := 0, /„eu; := 

3: repeat 

4* -'o/d ■ — ^new 

5: for all (h :- body) G i* do 

6: split body in Bpp (prob. facts) and Bl (the rest) 

7: for all grounding substitution 9 such that BpO <^ I old do 

8: s := true, Wd '■= 1 

9: while s A i?pi? 7^ do 

10: select and remove p/ from Bpp 

11: {bpf,Wpf) :=ReadTable(p/6', /o;^, Tpf,Td^s,L, e, Depi/i) 

12: s := s A 5^/ w^ := w^ • Wpf 

13: if s then 

14: if /16' e e^ then return {Inew, 0) > check negative evidence 

15: Inew ■= Inew U {/l6'} W := W ■ Wd 

16: until /„e^ = loid y w = > Fixpoint or impossible evidence 

17: if e+ C Inew then return {Inew,w) > check positive evidence 

18: else return {lnew,0) 

generating a large number of samples or possible worlds (cf. Algorithm nl) . The al- 
gorithm starts by preparing the program L for sampling by applying the PMagic 
transformation. In the following, we discuss our choice of subroutine STPMagic 
(cf. Algorithm [2]) which realizes likelihood weighted sampling. It is used in Algo- 
rithm[T} linejS] to generate individual samples. It iterates the stochastic consequence 
operator of Definition|4]until either a fixpoint is reached or the current sample is in- 
consistent with the evidence. Finally, if the sample is inconsistent with the evidence, 
it receives weight 0. 

Algorithm [3] details the procedure used in line [Tl] of Algorithm [5] to sample from 
a given distributional clause. The function ReadTable returns the truth value of 
the probabilistic fact, together with its weight. If the outcome is not yet tabled, 
it is computed. Note that false is returned when the outcome is not consistent 
with the evidence. Involved distributions, if not yet tabled, are sampled in line[5J In 
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Algorithm 3 Evaluating a probabilistic fact pf; used in Algorithm [2] COM- 
PiJTEPF{pf, Tdis) computes the truth value and the probability oi pf according 
to the information in T^is- 

1: function ReadTable^p/, I ,Tpf,Tdis, L,e, Depth) 

2: if pf ^ Tpf then 

3: for all random variable h occurring in pf where h ^ Tdis do 

4: if h ^ D ^ I then return {false, 0) 

5: if not Sample(/i, D, Tdis,I, L, e, Depth) then return {false, 0) 

6: (6,u;):=COMPUTEPF(p/,Td,,) 

7: if {bA{pf e e")) V {^b A {pf G e+)) then 

8: return {false, 0) > inconsistent with evidence 

9: extend Tpf with {pf, b, w) 

10: return {b, w) as stored in Tpf for pf 

11: procedure SAMPLE(h,I?, T^is, -^, i, e, Depth) 

12: ui^ := 1, D' :~ D > Initial weight, temp, distribution 

13: if 2?' = [pi : ai, . . . ,p„ : a^] then t> finite distribution 

14: for pj : aj G P' where dist_eq(h, aj) G e^ do > remove neg. evidence 

15: V := Norm(X>' \ {pj : aj}), Wh := w^ x (1 - pj) 

16: if 3i; : dist_eq(~(h), v) G e+ and p : v e V then 

17: P' := [1 : w], Wh -.^ Wh x p 

18: for pj : aj G P' do > remove choices that make e"*" impossible 

19: if 36 G e+: not MaybeProof(6, Depi/i,/ U {dist_eq(h, aj)},X) or 

20: 36 € e~: not MaybeFail(6, Dept/i, / U {dist_eq(h, aj)}, L) then 

21: V :^Norm{V'\ {pj -.a j}), Wh := Wh x {I ~ pj) 

22: if 2?' = return false 

23: Sample x according to V, extend Tdis with {h,x) and return true 



the infinite case, Sample simply returns the sampled value. In the finite case, it is 
directed towards generating samples that are consistent with the evidence. Firstly, 
all possible choices that are inconsistent with the negative evidence are removed. 
Secondly, when there is positive evidence for a particular value, only that value is left 
in the distribution. Thirdly, it is checked whether each left value is consistent with 
all other evidence. This consistency check is performed by a simple depth-bounded 
meta- interpreter. For positive evidence, it attempts a top-down proof of the evidence 
atom in the original program using the function MaybeProof. Subgoals for which 
the depth-bound is reached, as well as probabilistic facts that are not yet tabled 
are assumed to succeed. If this results in a proof, the value is consistent, otherwise 
it is removed. Similarly for negative evidence: in MaybeFail, subgoals for which 
the depth-bound is reached, as well as probabilistic facts that are not yet tabled 
are assumed to fail. If this results in failure, the value is consistent, otherwise 
it is removed. The Depth parameter allows one to trade the computational cost 
associated with this consistency check for a reduced rejection rate. 
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Note that the modified distribution is normahzed and the weight is adjusted in 
each of these three cases. The weight adjustment takes into account that removed 
elements cannot be sampled and is necessary as it can depend on the distributions 
sampled so far which elements are removed from the distribution sampled in Sam- 
ple (the clause bodies of the distribution clause are instantiating the distribution). 

5 Experiments 

We implemented our algorithm in YAP Prolog and set up experiments to answer 
the questions 

Ql Does the lookahead-based sampling improve the performance? 
Q2 How do rejection sampling and likelihood weighting compare? 

To answer the first question, we used the distributional program in Figure [T] 
which models an urn containing a random number of balls. The number of balls 
is uniformly distributed between 1 and 10 and each ball is either red or green 
with equal probability. We draw 8 times a ball with replacement from the urn 
and observe its color. We also define the atom nogreen(£') to be true if and 
only if we did not draw any green ball in draw 1 to D. We evaluated the query 
P(dist_eq(~ (color(~ (drawnball(l)))),red) |nogreen(L»)) for D ^ 1,2,..., 8. 
Note that the evidence implies that the first drawn ball is red, hence that the prob- 
ability of the query is 1; however, the number of steps required to proof that the 
evidence is inconsistent with drawing a green first ball increases with D, so the 
larger is D, the larger Depth is required to reach a 100% acceptance rate for the 
sample as illustrated in Figure [T] It is clear that by increasing the depth limit, 
each sample will take longer to be generated. Thus, the Depth parameter allows 
one to trade off convergence speed of the sampling and the time each sample needs 
to be generated. Depending on the program, the query, and the evidence there is 
an optimal depth for the lookahead. 



To answer Q2, we used the standard example for BLOG (Milch et al. 2005). An 
urn contains an unknown number of balls where every ball can be either green or 
blue with p — 0.5. When drawing a ball from the urn, we observe its color but do 
not know which ball it is. When we observe the color of a particular ball, there is a 
20% chance to observe the wrong one, e.g. green instead of blue. We have some prior 
belief over the number of balls in the urn. If 10 balls are drawn with replacement 
from the urn and we saw 10 times the color green, what is the probability that there 
are n balls in the urn? We consider two different prior distributions: in the first case, 
the number of balls is uniformly distributed between 1 and 8, in the second case, 
it is Poisson-distributed with mean A = 6. 

One run of the experiment corresponds to sampling the number N of balls in the 
urn, the color for each of the N balls, and for each of the ten draws both the ball 
drawn and whether or not the color is observed correctly in this draw. Once these 
values are fixed, the sequence of colors observed is determined. This implies that for 
a fixed number A^ of balls, there are 2^ ■ N^^ possible proofs. In case of the uniform 
distribution, exact PRISM inference can be used to calculate the probability for 
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numballs ~ unif orm([l, 2, 3, 4, 5, 6, 7, 8, 9, 10]). 
ball(M) :- between(l, numballs, M). 
color(B) ~ unif orm( [red, green]) :- ball(B). 
draw(N) :- between(l, 8, N). 
nogreen(O). 

nogreen(D) :- dist_eq(~ (color(~ (drawnball(D)))), red), D2 is D — 1, nogreen(D2) 
drawnball(D) ~ unif orm(L) :- draw(D),f indall(B,ball(B), L). 
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Fig. 1. The program modeling the urn (left); rate of accepted samples (right) for 
evaluating the query P(dist_eq(~ (color(~ (drawnball(l)))), red) | nogreen(D)) 
for _D = 1, 2, . . . , 10 and for Depth = 1, 2, . . . , 8 using AlgorithmjT] The acceptance 
rate is calculated by generating 200 samples using our algorithm and counting the 
number of sample with weight larger than 0. 
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Fig. 2. Results of urn experiment with forward reasoning. 10 balls with replacement 
were drawn and each time green was observed. Left: Uniform prior over # balls, 
right: Poisson prior (A = 6). 



each given number of balls, with a total runtime of 0.16 seconds for all eight cases. 
In the case of the Poisson distribution, this is only possible up to 13 balls, with more 
balls, PRISM runs out of memory. For inference using sampling, we generate 20,000 
samples with the uniform prior, and 100,000 with Poisson prior. We report average 
results over five repetitions. For these priors, PRISM generates 8,015 and 7,507 
samples per second respectively, ProbLog backward sampling 708 and 510, BLOG 
3,008 and 2,900, and our new forward sampling (with rejection sampling) 760 and 
731. The results using our algorithm for both rejection sampling and likelihood 
weighting with Depth = are shown in Figure [2j As the graphs show, the standard 
deviation for rejection sampling is much larger than for likelihood weighting. 
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6 Conclusions and related work 

We have contributed a novel construct for probabilistic logic programming, the dis- 
tributional clauses, and defined its semantics. Distributional clauses allow one to 
represent continuous variables and to reason about an unknown number of objects. 
In this regard this construct is similar in spirit to languages such as BLOG and 
Church, but it is strongly embedded in a logic programming context. This embed- 
ding allowed us to propose also a novel inference method based on a combination 
of importance sampling and forward reasoning. This contrasts with the majority of 
probabilistic logic programming languages which are based on backward reasoning 



(possibly enhanced with tabling ( Sato and Kameya 2001 Mantadelis and Janssens 



2010)). Furthermore, only few of these techniques employ sampling, but see (Kim- 



mig et al. 2011 ) for a Monte Carlo approach using backward reasoning. Another 
key difference with the existing probabilistic logic programming approaches is that 
the described inference method can handle evidence. This is due to the magic set 
transformation that targets the generative process towards the query and evidence 
and instantiates only relevant random variables. 



P-log ( Baral et al. 2009 1 is a probabilistic language based on Answer Set Prolog 
(ASP). It uses a standard ASP solver for inference and thus is based on forward 
reasoning, but without the use of sampling. Magic sets are also used in proba- 



bilistic Datalog (Fuhr 2000), as well as in Dyna, a probabilistic logic programming 



language ( Eisner et al. 2005 ) based on rewrite rules that uses forward reasoning. 
However, neither of them uses sampling. Furthermore, Dyna and PRISM require 
that the exclusive-explanation assumption. This assumption states that no two dif- 
ferent proofs for the same goal can be true simultaneously, that is, they have to rely 
on at least one basic random variable with different outcome. Distributional clauses 
(and the ProbLog language) do not impose such a restriction. Other related work 



includes MCMC-based sampling algorithms such as the approach for SLP (An 



gelopoulos and Cussens 2003]). Church's inference algorithm is based on MCMC 



too, and also BLOG is able to employ MCMC. At least for BLOG it seems to be 
required to define a domain-specific proposal distribution for fast convergence. With 
regard to future work, it would be interesting to consider evidence on continuous 
distributions as it is currently restricted to finite distribution. Program analysis and 
transformation techniques to further optimize the program w.r.t. the evidence and 
query could be used to increase the sampling speed. Finally, the implementation 
could be optimized by memoizing some information from previous runs and then 
use it to more rapidly prune as well as sample. 
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